Overview

This is an example analysis of a mock dataset using R notebooks. R notebooks allow you to create dynamic documents that include R code and the output and has support for markdown.

library(tidyverse)
library(cowplot)
library(plotly)
library(broom)

Import raw data

We will use the read_csv to read in the raw .csv file that is hosted on my github account. Notice how we used the assignment operator to make the read_csv into a ‘data’ object. When you call it, you will see the how the data looks like. Notice how it’s in the ‘wide’ format.

file <- "https://raw.githubusercontent.com/nguyens7/nguyens7.github.io/master/data/Antibiotics.csv"
data <- read_csv(file)
data

Making wide format into long format

Next we will use the gather function to make the data ‘long’. Here we create two columns Bacteria and Count and transpose the ‘wide’ into ‘long’ format.

data2 <- data %>%
  gather(Bacteria, Count, 2:10)
data2

Separate the Bacteria Column into Experiment and Organism

Now that the data is in the long format, we can use the separate command to separate out values from within a column. We’ll create two new columns named Experiment and Organism and you can tell R that you want to separate by “_“.

data3 <- data2 %>% 
  separate(Bacteria, into=c("Experiment","Organism", sep = "_")) %>% 
  select(-`_`)

#Make the Treatment and Organism columns as categorical 'factors'
data3$Treatment <- as.factor(data3$Treatment)
data3$Organism <- as.factor(data3$Organism)
data3

Average within each experimental replicate

Next we’ll use the group_by function to ‘lock in’ specific categories and then use the summarise/summarize function to collapse the ‘long data’ into a smaller data frame. What you can see here is that we will average the 5 technical replicates across all experiments and calculate the standard deviation and standard error of the mean.

data4 <- data3 %>% 
  group_by(Organism, Treatment, Experiment) %>% 
  summarise( N = length(Count),
          mean = mean(Count),
            sd = sd(Count),
            se = sd/sqrt(N))
data4
#write_csv(data4,"experimental_summary.csv")

Average all the experiments

We’ll do the same thing again but average each of the three experimental replicates.

data5 <- data4 %>% 
 group_by(Organism,Treatment) %>% 
  summarise( avg_N   = length(mean),
             average = mean(mean),
             avg_sd  = sd(mean),
             avg_se  = avg_sd/sqrt(avg_N))
data5  
#write_csv(data5,"final_summary.csv")

Graphing

Boxplot

We can create a box plot to get an idea of how the data points are distributed. We start by defining what we want for our x-axis and y-axis and we can also map color to Organisms. Nex we will add a boxplot and points to the graph. The last thing we can do is utilize the facet_wrap command to separate out the data, in this case we will separate out by Treatment.

boxplot <- data3 %>% 
  group_by(Treatment, Organism) %>% 
  ggplot(aes(x = Organism, y = Count, color = Organism))+
  geom_boxplot(colour="black", fill=NA) + 
  geom_point(position= 'jitter', size=2) +
  ylab("\nColony Count\n") + # Y axis label
  ggtitle("Effect of Antibiotic on Bacterial Growth") + #title
  facet_wrap(~Treatment)

boxplot

Interacive boxplot

We can also use the plotly package to make the graphs interactive.

ggplotly(boxplot)

Bar graph, facet by experiment

We can use the ggplot function to create a bar graph and also plot error bars. Notice how we are able to utilize our tidy data to easily plot things.

#Bar graph of experiments
Experimental_Replicates <- data4 %>% 
  ggplot(aes(x=Organism,y=mean,fill=Treatment))+
  geom_col(position="dodge")+
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.4, 
                size=0.8, colour="black", position=position_dodge(.9)) + #error bars
  scale_y_continuous(expand=c(0,0))+ #set bottom of graph
  xlab("Organism") + # X axis label
  ylab("\nColony Count\n") + # Y axis label
  ggtitle("Effect of Antibiotic on Bacterial Growth")+ #title
  facet_wrap(~Experiment)

Experimental_Replicates

###Final plot

#Final graph
Final_plot <- data5 %>% 
  ggplot(aes(x=Organism,y=average,fill=Treatment))+
  geom_col(position="dodge")+
  geom_errorbar(aes(ymin=average-avg_se, ymax=average+avg_se), width=.4, 
                size=0.8, colour="black", position=position_dodge(.9)) + #error bars
  scale_y_continuous(expand=c(0,0))+ #set bottom of graph
  xlab("Organism") + # X axis label
  ylab("\nColony Count\n") + # Y axis label
  ggtitle("Effect of Antibiotic on Bacterial Growth")

Final_plot

Statistics

R is really useful for performing statistical analysis on your data. For this mock dataset, I chose to run a parametric shapiro test. I just use the shapiro.test command and specify the ‘Count’ column from data3 dataframe. Next I use the tidy function to create a neat dataframe of the statistics.

#Parametric Test
shapiro <- shapiro.test(data3$Count)
tidy(shapiro)

Fail to reject Ho -> data is normal

ANOVA

Next we can peform an ANOVA which follows a simple formula. Use the aov command and then specify the numerical value then ‘~’ and then your other variables that you want to compare, in our case we want to compare the mean between Organism and Treatment.

ANOVA <- aov(mean~(Organism*Treatment),data=data4)
tidy(ANOVA)

Tukey HSD Post Hoc Test

We can then perform a post hoc test by using the TukeyHSD analysis.

#Tukey HSD
HSD <- TukeyHSD(ANOVA)
tukey <- tidy(HSD)
tukey

Sorting on significant Tukey HSD

We can sort this by p values that are less than 0.05 and arrange them in the correct order.

#Aggregate significant results
sig.tukey <- tukey %>% 
  filter(adj.p.value<0.05) %>% 
  arrange(adj.p.value)
sig.tukey